First, we must know what is the definition of the unemployment rate to understand how it is derived and what factors are related to it. A person is defined as unemployed in the United States if they are jobless, but have looked for work in the last four weeks and are available for work. To record unemployed, Government distributes survey to sampling population and predict the entire unemployed number in a broad area. Measuring the unemployment gives us a good overview of the ongoing status of the economy, international competition, technology development, and so on.
The equation of the Unemployment Rate is
\[unemployment \space rate = \frac{unemployed}{labor \space force}\times100\] where labor force includes all people age 16 and older who are classified as either employed and unemployed.
In this project, I will focus on analyzing and predicting the unemployment rate in the LA County.
unprocessed = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/unprocessed_data.csv")
head(unprocessed)
## DATE unemploy_rate_la avg_price_pipedgas_la avg_price_electr_kwh_La
## 1 1990/1/1 5.9 18.662 0.105
## 2 1990/2/1 5.6 19.971 0.108
## 3 1990/3/1 5.4 19.971 0.108
## 4 1990/4/1 5.5 19.976 0.108
## 5 1990/5/1 5.4 27.721 0.107
## 6 1990/6/1 5.4 27.712 0.108
## avg_price_gasolone_la civilian_labor_force_la_pch cpi_allitems_la
## 1 0.957 #N/A 1.1
## 2 0.988 0.2 1.1
## 3 1.014 0.1 0.7
## 4 1.030 -0.4 -0.2
## 5 1.080 0.4 0.3
## 6 1.103 0 0.3
## economics_cond_index_la unemployed_num_pch
## 1 NA #N/A
## 2 1.24 -4.3
## 3 0.58 -3.8
## 4 -0.35 1
## 5 -0.42 -0.9
## 6 -0.64 -0.5
## new_private_housing_structure_issue_la home_price_index_la
## 1 5121.609 0.7
## 2 4648.972 0.4
## 3 3628.443 0.2
## 4 3833.476 -0.1
## 5 3466.321 -1.1
## 6 3496.684 -0.3
## allemployee_nonfarm_la_pch allemployee_constr_la_pch allemployee_manu_la_pch
## 1 #N/A #N/A #N/A
## 2 0.1 -1.2 0
## 3 -0.1 -1.1 -0.4
## 4 -0.2 -3.7 -0.4
## 5 -0.2 -1.3 -0.4
## 6 -0.3 -0.9 -0.7
## allemployee_finan_la_pch allemployee_leisure_la_pch new_patent_ass_la
## 1 #N/A #N/A 202
## 2 -0.1 -0.4 218
## 3 -0.7 -0.1 290
## 4 -0.2 -0.6 256
## 5 -0.8 -0.1 262
## 6 -0.3 0.3 242
## govn_social_insu_pch compen_employee_wage_pch real_disp_inc_per_capital_pch
## 1 1.6 0.5 0.5
## 2 0.0 1.2 0.1
## 3 1.0 0.7 -0.1
## 4 0.0 0.9 0.5
## 5 0.3 -0.3 -0.2
## 6 1.3 0.8 0.0
## bbk_real_gdp us_interest_rate pers_consum_expen_pch pers_saving_rate
## 1 4.8776608 7 1.3 8.0
## 2 6.1814509 7 -0.1 8.6
## 3 2.9195562 7 0.7 8.3
## 4 -0.5634379 7 0.4 8.8
## 5 0.7507924 7 0.2 8.7
## 6 1.1771073 7 0.8 8.6
## pers_current_tax_chg govn_social_ben_toperson_pch federal_fund_eff_rate
## 1 -8.4 5.3 8.229032
## 2 8.1 -0.2 8.237143
## 3 5.2 0.7 8.276774
## 4 4.2 0.6 8.255000
## 5 0.6 -0.5 8.176452
## 6 3.9 1.2 8.288667
## X30_year_fixed_mortgage
## 1 1.54967
## 2 3.05710
## 3 0.69135
## 4 0.99338
## 5 1.03664
## 6 -2.99213
Although all variables are supposed to be numeric, but in fact some of them are imported as character variables. Also, we need to deal with missing value in columns contained “PCH” which means “Percent Change”.
unprocessed = unprocessed[-1,]
date = unprocessed[,1]
# delete NA in the first row
unprocessed = unprocessed[,-1] %>% mutate_if(is.character, as.numeric)
unprocessed$DATE = date
unprocessed = unprocessed %>%
select(DATE, everything())
# delete variables that have at least 8 missing values
processed_data = unprocessed %>% select(-avg_price_pipedgas_la, -new_patent_ass_la,
-cpi_allitems_la, -us_interest_rate,
-economics_cond_index_la) %>%
head(-2)
# no missing value
sum(sapply(processed_data, function(x) sum(is.na(x))))
## [1] 0
write.csv(processed_data, "D:\\UCSB\\Spring_2022\\PSTAT 131\\PSTAT_131_HW\\HW2\\PSTAT-131\\Final Project\\data\\processed_data.csv", row.names = FALSE)
# start with Feb since we delete the first row of unprocessed data
processed = ts(unprocessed[,-1],frequency = 12, start = c(1990,2))
autoplot.zoo(processed[,"unemploy_rate_la"])+
ggtitle("Unemployment Rate in LA County") +
xlab("Year") +
ylab("Percentage%")
We discovered that there are a few well-known recession periods from
1990 to now. The collapse of internet bubble, the financial crisis of
2007, and Covid-19 pandemic all matches severe increase of the
unemployment rate. We may study how different economics indexes
fluctuates during the financial crisis of 2007 which is mainly caused by
the mortgage debt.
Now we want to discover the seasonal pattern of the unemployment. Except the fluctuation during the Covid-19 seems abnormal, we discover that the unemployment rate usually peaks in summer. This is a problem requires further research.
ggseasonplot(processed[,"unemploy_rate_la"]) +
ggtitle("Seasonal Plot of Unemployment Rate in LA County") +
xlab("Year") +
ylab("Percentage%")
ggsubseriesplot(processed[,"unemploy_rate_la"]) +
ylab("$ million") +
ggtitle("Seasonal subseries plot: LA Unemployment Rate")
Net Migration
census_api_key("7540e4d61b8467521425225cbe8f44f7c1667f9a")
net_migration <- get_estimates(geography = "county", state = "CA",
variables = "RNETMIG",
year = 2019,
geometry = TRUE,
resolution = "20m") %>%
shift_geometry()
order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")
net_migration <- net_migration %>%
mutate(groups = case_when(
value > 15 ~ "+15 and up",
value > 5 ~ "+5 to +15",
value > -5 ~ "-5 to +5",
value > -15 ~ "-15 to -5",
TRUE ~ "-15 and below"
)) %>%
mutate(groups = factor(groups, levels = order))
state_overlay <- states(
cb = TRUE,
resolution = "20m"
) %>%
filter(GEOID != "72") %>%
shift_geometry()
ggplot() +
geom_sf(data = net_migration, aes(fill = groups, color = groups), size = 0.1) +
scale_fill_brewer(palette = "PuOr", direction = -1) +
scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE) +
labs(title = "Net migration per 1000 residents in CA",
subtitle = "US Census Bureau 2019 Population Estimates",
fill = "Rate") +
theme_minimal(base_family = "Roboto")
We found LA county is losing population this 5 years. It requires
further discussion how this trend will effect the unemployment rate.
Median Age
#median age
med_age <- get_acs(state = "CA", county = "Los Angeles", geography = "tract",
variables = "B01002_001", geometry = TRUE)
med_age %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma")
All modeltime algorithm must include a date-time feature.
model_data = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/processed_data.csv")
I first use data from 1990-2016 to test whether these models can only use previous values of the unemployment rate and date to forecast in a satisfactory accuracy. Further, I want to utilize models which consider other economics indicators. Hopefully, I can create models that can first perceive economics situation in the US and then determine the trend of the unemployment rate.
Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again. When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. Therefore, the main concern may be only to predict what will happen, not to know why it happens.
data = model_data %>%
mutate(DATE, DATE = as.Date.character(DATE))
data = data %>% select(DATE, contains("la"))
data_2016 = data[1:320,] # I don't want to include pandemic
data_2017 = data[321:332,]
data_2016 %>% plot_time_series(DATE, unemploy_rate_la)
Splitting the data set and creating the training and testing set
splits <- initial_time_split(data_2016, prop = 0.9)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(DATE, unemploy_rate_la)
train_2016 = training(splits)
test_2016 = testing(splits)
Correlation graph
train_2016 %>% select(contains("la")) %>%
cor() %>%
corrplot(type = "upper", tl.pos = "td",
method = "circle", tl.cex = 0.5, tl.col = 'black',
order = "hclust", diag = FALSE)
Here, we only compare the unemployment rate with economics indexes in LA
and discover that the unemployment rate is highly correlated with
average price of electricity, average price of
gasoline, and the number of new private housing structure
issue in LA. Thus, I may focus on these three predictor variables
when creating models.
\[S_t=\phi_{21}S_{t-1}+\phi_{22}S_{t-2} + \epsilon_t\] ACF plot is used to detect lagged features, fourier series periods, and data features via cycles. A time series may have a relationship to previous versions of itself. These are called lags. Then, the autocorrelation is introduced to measure the strength of the relationship to its lags. I want to use ACF plot to identify lags, which determines the recipe in my machine learning models.
# ACF Diagnostics
data %>%
plot_acf_diagnostics(DATA, unemploy_rate_la, .lags = 100)
From PACF graph, I find that lag 13 is still materially different as time passes. However, I may not utilize a long lag since I only want to predict the unemployment rate 2 or 3 months ahead. Therefore, lag 6 (half of a year) will be included in my lag recipe.
What is ARIMA? What are its parameters?
ARIMA is a simple algorithm that uses linear model to model lags. It performs automated differencing and recursive lag forecasting. Also, we can add fourier function to simulate seasonalities. However, ARIMA is very sensitive to number of lags and forecast can be erratic. Although regularization is not implemented, the parameter search might still cost lots of time.
According to the documentation of Auto ARIMA, it selects parameters based on which ever model yields the best In-sample AIC value. During refitting, if a new parameter set yields a lower AICc value, then the new model is selected.
# Auto ARIMA
model_fit_auto_arima <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(unemploy_rate_la ~ DATE,
data = train_2016)
# w/ XREGS
model_fit_auto_arima_events <- arima_reg() %>%
set_engine(engine = "auto_arima") %>%
fit(unemploy_rate_la ~ DATE + avg_price_electr_kwh_La + avg_price_gasolone_la +
new_private_housing_structure_issue_la,
data = train_2016)
# Calibrate
calibration_tbl <- modeltime_table(model_fit_auto_arima,
model_fit_auto_arima_events) %>%
modeltime_calibrate(test_2016)
# Forecast test
calibration_tbl %>%
modeltime_forecast(
new_data = test_2016,
actual_data = data_2016
) %>%
plot_modeltime_forecast()
# Accuracy Test
calibration_tbl %>% modeltime_accuracy()
## # A tibble: 2 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(4,0,1)(2,1,1)[12] Test 0.727 11.9 2.19 11.1 0.851 0.643
## 2 2 REGRESSION WITH ARIMA(1,1~ Test 0.333 5.43 1.00 5.24 0.388 0.961
Clearly, after adding three predictor variables in our model, we are able to forecast local peaks in the short time of period. Then, we can keep forecasting beyond 2016 data.By checking the real values, we found that the green line does a great job on forecasting the trend and the seasonality of the unemployment rate in LA county.
# Refit
refit_tbl <- calibration_tbl %>%
modeltime_refit(data_2016)
refit_tbl %>% modeltime_forecast(
new_data = data_2017,
actual_data = data_2016
) %>%
plot_modeltime_forecast(
.conf_interval_alpha = 0.05
)
changepoint_range: Adjusts the flexibility of the trend
component by limiting to a percentage of data before the end of the time
series.
changepoint_num: Number of potential change points to include
for modeling trend.
#Prophet
model_fit_prophet <- prophet_reg(
changepoint_num = 20,
changepoint_range = 0.9
) %>%
set_engine(engine = "prophet") %>%
fit(unemploy_rate_la ~ DATE, data = train_2016)
model_fit_prophet_xregs <- prophet_reg(
changepoint_num = 20,
changepoint_range = 0.9,
seasonality_yearly = 1
) %>%
set_engine(engine = "prophet") %>%
fit(unemploy_rate_la ~ DATE + avg_price_electr_kwh_La + avg_price_gasolone_la +
new_private_housing_structure_issue_la, data = train_2016)
modeltime_table(
model_fit_prophet,
model_fit_prophet_xregs
) %>% modeltime_calibrate(new_data = test_2016) %>%
modeltime_forecast(
new_data = test_2016,
actual_data = data_2016
) %>%
plot_modeltime_forecast(.interactive = TRUE)
The forecast graphs shows that the prophet model wrongly predicts the trend in the testing set. Therefore, we want to visualize the effect of key parameters on the prophet model to find out what happen. No matter what I adjust the change point range and number, the forecst graph does not change.
prophet_model <- model_fit_prophet$fit$models$model_1
prophet_fcst <- predict(prophet_model,
newdata = train_2016)
g <- plot(prophet_model, prophet_fcst) +
add_changepoints_to_plot(prophet_model)
ggplotly(g)
It seems that prophet model does not discover the change of trend after 2010 and predict the unemployment rate would keep growing. I may hypothesize that the growth of unemployment rate during Mortgage Crisis is so influential that the model ignores the slightly decrease afterward. Without detecting the decrease trend at the end of training set, even the XREGS cannot save the forecast. Although the prophet model may perform well if I extend the training set, this shortcoming makes me abandon Prophet model.
I will further test the Exponential Smoothing
I want to consider the lag when building the model
data_2016_full <- data_2016 %>%
bind_rows(
# add future window
future_frame(.data = ., .date_var = DATE, .length_out = 1)
) %>%
# add auto correlated lags 1 months
tk_augment_lags(unemploy_rate_la, .lags = 1) %>%
tk_augment_lags(unemploy_rate_la, .lags = 4) %>%
tk_augment_lags(unemploy_rate_la, .lags = 12) %>%
tk_augment_lags(contains("la"), .lags = 1) %>%
tk_augment_lags(contains("la"), .lags = 4) %>%
tk_augment_lags(contains("la"), .lags = 12)
data_2016_full
## # A tibble: 321 x 106
## DATE unemploy_rate_la avg_price_electr_kwh_La avg_price_gasolone_la
## <date> <dbl> <dbl> <dbl>
## 1 1990-02-01 5.6 0.108 0.988
## 2 1990-03-01 5.4 0.108 1.01
## 3 1990-04-01 5.5 0.108 1.03
## 4 1990-05-01 5.4 0.107 1.08
## 5 1990-06-01 5.4 0.108 1.10
## 6 1990-07-01 6.2 0.108 1.13
## 7 1990-08-01 6.2 0.108 1.28
## 8 1990-09-01 6.3 0.108 1.35
## 9 1990-10-01 6.2 0.109 1.42
## 10 1990-11-01 6.5 0.11 1.42
## # ... with 311 more rows, and 102 more variables:
## # civilian_labor_force_la_pch <dbl>,
## # new_private_housing_structure_issue_la <dbl>, home_price_index_la <dbl>,
## # allemployee_nonfarm_la_pch <dbl>, allemployee_constr_la_pch <dbl>,
## # allemployee_manu_la_pch <dbl>, allemployee_finan_la_pch <dbl>,
## # allemployee_leisure_la_pch <dbl>, unemploy_rate_la_lag1 <dbl>,
## # unemploy_rate_la_lag4 <dbl>, unemploy_rate_la_lag12 <dbl>, ...
new_splits <- initial_time_split(data_2016_full, prop = 0.9)
new_train_2016 <- training(new_splits)
new_test_2016 <- testing(new_splits)
Elastic Net Regression is very good at capturing trends, but we may not use it for complex patterns. Also, elastic net applies regularization to a linear regression
We know a linear regression tend to over-fit if we add many predictors
recipe_spec_base <- recipe(unemploy_rate_la ~., data = new_train_2016) %>%
step_timeseries_signature(DATE) %>%
# feature removal
step_rm(matches("(iso)|(xts)|(hour)|(minute)|(second)|(am.pm)|(day)|(week)")) %>%
# standardization
step_normalize(matches("(index.num)|(year)|(issue)")) %>%
# month feature is converted to dummy variables
step_dummy(all_nominal(), one_hot = TRUE)
# may later add interaction and fourier series features
recipe_spec = recipe_spec_base %>%
step_naomit(matches("lag"))
# spline
recipe_spec_1 <- recipe_spec_base %>%
step_rm(DATE) %>%
step_ns(ends_with("index.num")) %>%
step_rm(matches("lag"))
recipe_spec %>% prep() %>% juice() %>% glimpse()
## Rows: 259
## Columns: 123
## $ DATE <date> 1992-07-01, 19~
## $ avg_price_electr_kwh_La <dbl> 0.120, 0.120, 0~
## $ avg_price_gasolone_la <dbl> 1.393, 1.385, 1~
## $ civilian_labor_force_la_pch <dbl> 0.7, -0.2, -0.9~
## $ new_private_housing_structure_issue_la <dbl> -0.2055146, -0.~
## $ home_price_index_la <dbl> -0.9, -0.9, -0.~
## $ allemployee_nonfarm_la_pch <dbl> -0.2, -0.2, -0.~
## $ allemployee_constr_la_pch <dbl> -1.4, -0.9, -0.~
## $ allemployee_manu_la_pch <dbl> -0.4, -0.7, -0.~
## $ allemployee_finan_la_pch <dbl> -1.1, -1.0, 0.0~
## $ allemployee_leisure_la_pch <dbl> 2.0, -0.2, -0.1~
## $ unemploy_rate_la_lag1 <dbl> 10.2, 10.6, 10.~
## $ unemploy_rate_la_lag4 <dbl> 9.3, 8.9, 9.5, ~
## $ unemploy_rate_la_lag12 <dbl> 8.6, 8.4, 8.4, ~
## $ avg_price_electr_kwh_La_lag1 <dbl> 0.120, 0.120, 0~
## $ avg_price_gasolone_la_lag1 <dbl> 1.374, 1.393, 1~
## $ civilian_labor_force_la_pch_lag1 <dbl> 0.9, 0.7, -0.2,~
## $ new_private_housing_structure_issue_la_lag1 <dbl> -0.3499802, -0.~
## $ home_price_index_la_lag1 <dbl> -1.1, -0.9, -0.~
## $ allemployee_nonfarm_la_pch_lag1 <dbl> -0.4, -0.2, -0.~
## $ allemployee_constr_la_pch_lag1 <dbl> -1.5, -1.4, -0.~
## $ allemployee_manu_la_pch_lag1 <dbl> -0.5, -0.4, -0.~
## $ allemployee_finan_la_pch_lag1 <dbl> -0.1, -1.1, -1.~
## $ allemployee_leisure_la_pch_lag1 <dbl> -0.4, 2.0, -0.2~
## $ unemploy_rate_la_lag1_lag1 <dbl> 9.5, 10.2, 10.6~
## $ unemploy_rate_la_lag4_lag1 <dbl> 9.5, 9.3, 8.9, ~
## $ unemploy_rate_la_lag12_lag1 <dbl> 8.2, 8.6, 8.4, ~
## $ avg_price_electr_kwh_La_lag4 <dbl> 0.120, 0.119, 0~
## $ avg_price_gasolone_la_lag4 <dbl> 1.215, 1.242, 1~
## $ civilian_labor_force_la_pch_lag4 <dbl> -0.4, 0.1, -0.1~
## $ new_private_housing_structure_issue_la_lag4 <dbl> -0.4126281, -0.~
## $ home_price_index_la_lag4 <dbl> -0.7, -0.7, -0.~
## $ allemployee_nonfarm_la_pch_lag4 <dbl> -0.1, -0.2, -0.~
## $ allemployee_constr_la_pch_lag4 <dbl> 1.3, -0.5, -0.8~
## $ allemployee_manu_la_pch_lag4 <dbl> -0.4, -0.6, -0.~
## $ allemployee_finan_la_pch_lag4 <dbl> 0.3, -0.4, 0.5,~
## $ allemployee_leisure_la_pch_lag4 <dbl> -0.4, 0.5, -0.7~
## $ unemploy_rate_la_lag1_lag4 <dbl> 9.5, 9.3, 8.9, ~
## $ unemploy_rate_la_lag4_lag4 <dbl> 8.3, 8.4, 9.4, ~
## $ unemploy_rate_la_lag12_lag4 <dbl> 7.6, 7.4, 7.9, ~
## $ avg_price_electr_kwh_La_lag1_lag4 <dbl> 0.121, 0.120, 0~
## $ avg_price_gasolone_la_lag1_lag4 <dbl> 1.199, 1.215, 1~
## $ civilian_labor_force_la_pch_lag1_lag4 <dbl> 0.2, -0.4, 0.1,~
## $ new_private_housing_structure_issue_la_lag1_lag4 <dbl> -0.008603505, -~
## $ home_price_index_la_lag1_lag4 <dbl> -0.4, -0.7, -0.~
## $ allemployee_nonfarm_la_pch_lag1_lag4 <dbl> -0.2, -0.1, -0.~
## $ allemployee_constr_la_pch_lag1_lag4 <dbl> -3.9, 1.3, -0.5~
## $ allemployee_manu_la_pch_lag1_lag4 <dbl> -0.2, -0.4, -0.~
## $ allemployee_finan_la_pch_lag1_lag4 <dbl> -1.0, 0.3, -0.4~
## $ allemployee_leisure_la_pch_lag1_lag4 <dbl> 0.0, -0.4, 0.5,~
## $ unemploy_rate_la_lag1_lag1_lag4 <dbl> 9.4, 9.5, 9.3, ~
## $ unemploy_rate_la_lag4_lag1_lag4 <dbl> 8.2, 8.3, 8.4, ~
## $ unemploy_rate_la_lag12_lag1_lag4 <dbl> 7.6, 7.6, 7.4, ~
## $ avg_price_electr_kwh_La_lag12 <dbl> 0.117, 0.117, 0~
## $ avg_price_gasolone_la_lag12 <dbl> 1.126, 1.143, 1~
## $ civilian_labor_force_la_pch_lag12 <dbl> 0.9, -0.4, -0.1~
## $ new_private_housing_structure_issue_la_lag12 <dbl> -0.015576211, 0~
## $ home_price_index_la_lag12 <dbl> 0.2, -0.1, -0.4~
## $ allemployee_nonfarm_la_pch_lag12 <dbl> -0.2, -0.3, -0.~
## $ allemployee_constr_la_pch_lag12 <dbl> -0.1, -0.5, -1.~
## $ allemployee_manu_la_pch_lag12 <dbl> -0.7, -0.4, -0.~
## $ allemployee_finan_la_pch_lag12 <dbl> 0.6, -0.2, -0.6~
## $ allemployee_leisure_la_pch_lag12 <dbl> -0.1, 0.0, -0.2~
## $ unemploy_rate_la_lag1_lag12 <dbl> 8.2, 8.6, 8.4, ~
## $ unemploy_rate_la_lag4_lag12 <dbl> 7.6, 7.4, 7.9, ~
## $ unemploy_rate_la_lag12_lag12 <dbl> 6.2, 6.2, 6.3, ~
## $ avg_price_electr_kwh_La_lag1_lag12 <dbl> 0.117, 0.117, 0~
## $ avg_price_gasolone_la_lag1_lag12 <dbl> 1.173, 1.126, 1~
## $ civilian_labor_force_la_pch_lag1_lag12 <dbl> 1.0, 0.9, -0.4,~
## $ new_private_housing_structure_issue_la_lag1_lag12 <dbl> 0.145211149, -0~
## $ home_price_index_la_lag1_lag12 <dbl> 0.6, 0.2, -0.1,~
## $ allemployee_nonfarm_la_pch_lag1_lag12 <dbl> -0.3, -0.2, -0.~
## $ allemployee_constr_la_pch_lag1_lag12 <dbl> -1.1, -0.1, -0.~
## $ allemployee_manu_la_pch_lag1_lag12 <dbl> -0.5, -0.7, -0.~
## $ allemployee_finan_la_pch_lag1_lag12 <dbl> -0.3, 0.6, -0.2~
## $ allemployee_leisure_la_pch_lag1_lag12 <dbl> -0.2, -0.1, 0.0~
## $ unemploy_rate_la_lag1_lag1_lag12 <dbl> 7.9, 8.2, 8.6, ~
## $ unemploy_rate_la_lag4_lag1_lag12 <dbl> 7.6, 7.6, 7.4, ~
## $ unemploy_rate_la_lag12_lag1_lag12 <dbl> 5.4, 6.2, 6.2, ~
## $ avg_price_electr_kwh_La_lag4_lag12 <dbl> 0.118, 0.118, 0~
## $ avg_price_gasolone_la_lag4_lag12 <dbl> 1.012, 1.066, 1~
## $ civilian_labor_force_la_pch_lag4_lag12 <dbl> -0.1, -0.3, -0.~
## $ new_private_housing_structure_issue_la_lag4_lag12 <dbl> 0.263781427, -0~
## $ home_price_index_la_lag4_lag12 <dbl> -1.3, 0.2, 0.2,~
## $ allemployee_nonfarm_la_pch_lag4_lag12 <dbl> -0.4, 0.0, -0.4~
## $ allemployee_constr_la_pch_lag4_lag12 <dbl> -2.6, -0.4, 0.0~
## $ allemployee_manu_la_pch_lag4_lag12 <dbl> -0.8, -0.2, -0.~
## $ allemployee_finan_la_pch_lag4_lag12 <dbl> 0.0, -0.8, -0.2~
## $ allemployee_leisure_la_pch_lag4_lag12 <dbl> -0.2, 1.4, -0.2~
## $ unemploy_rate_la_lag1_lag4_lag12 <dbl> 7.6, 7.6, 7.4, ~
## $ unemploy_rate_la_lag4_lag4_lag12 <dbl> 6.5, 6.5, 7.3, ~
## $ unemploy_rate_la_lag12_lag4_lag12 <dbl> 5.4, 5.5, 5.4, ~
## $ avg_price_electr_kwh_La_lag1_lag4_lag12 <dbl> 0.118, 0.118, 0~
## $ avg_price_gasolone_la_lag1_lag4_lag12 <dbl> 1.108, 1.012, 1~
## $ civilian_labor_force_la_pch_lag1_lag4_lag12 <dbl> 0.9, -0.1, -0.3~
## $ new_private_housing_structure_issue_la_lag1_lag4_lag12 <dbl> 0.050017742, 0.~
## $ home_price_index_la_lag1_lag4_lag12 <dbl> -1.1, -1.3, 0.2~
## $ allemployee_nonfarm_la_pch_lag1_lag4_lag12 <dbl> -0.5, -0.4, 0.0~
## $ allemployee_constr_la_pch_lag1_lag4_lag12 <dbl> -1.7, -2.6, -0.~
## $ allemployee_manu_la_pch_lag1_lag4_lag12 <dbl> -0.8, -0.8, -0.~
## $ allemployee_finan_la_pch_lag1_lag4_lag12 <dbl> -0.2, 0.0, -0.8~
## $ allemployee_leisure_la_pch_lag1_lag4_lag12 <dbl> -0.6, -0.2, 1.4~
## $ unemploy_rate_la_lag1_lag1_lag4_lag12 <dbl> 7.3, 7.6, 7.6, ~
## $ unemploy_rate_la_lag4_lag1_lag4_lag12 <dbl> 6.2, 6.5, 6.5, ~
## $ unemploy_rate_la_lag12_lag1_lag4_lag12 <dbl> 5.6, 5.4, 5.5, ~
## $ unemploy_rate_la <dbl> 10.6, 10.4, 10.~
## $ DATE_index.num <dbl> -1.374937, -1.3~
## $ DATE_year <dbl> -1.380931, -1.3~
## $ DATE_half <int> 2, 2, 2, 2, 2, ~
## $ DATE_quarter <int> 3, 3, 3, 4, 4, ~
## $ DATE_month <int> 7, 8, 9, 10, 11~
## $ DATE_month.lbl_01 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_02 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_03 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_04 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_05 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_06 <dbl> 0, 0, 0, 0, 0, ~
## $ DATE_month.lbl_07 <dbl> 1, 0, 0, 0, 0, ~
## $ DATE_month.lbl_08 <dbl> 0, 1, 0, 0, 0, ~
## $ DATE_month.lbl_09 <dbl> 0, 0, 1, 0, 0, ~
## $ DATE_month.lbl_10 <dbl> 0, 0, 0, 1, 0, ~
## $ DATE_month.lbl_11 <dbl> 0, 0, 0, 0, 1, ~
## $ DATE_month.lbl_12 <dbl> 0, 0, 0, 0, 0, ~
recipe_spec_2 <- recipe_spec_base %>%
step_rm(DATE) %>%
step_naomit(matches("lag"))
model_spec_glmet <- linear_reg(
mode = "regression",
penalty = 0.01,
mixture = 0.5
) %>%
set_engine("glmnet")
# spline
wflw_fit_glmnet_spline <- workflow() %>%
add_model(model_spec_glmet) %>%
add_recipe(recipe_spec_1) %>%
fit(new_train_2016)
# lag
wflw_fit_glmnet_lag <- workflow() %>%
add_model(model_spec_glmet) %>%
add_recipe(recipe_spec_2) %>%
fit(new_train_2016)
calibration_tbl_2 <- modeltime_table(
wflw_fit_glmnet_spline,
wflw_fit_glmnet_lag
) %>%
update_model_description(1, "GLMNET - Spline") %>%
update_model_description(2, "GLMNET - Lag") %>%
modeltime_calibrate(new_test_2016)
calibration_tbl_2 %>% modeltime_accuracy()
## # A tibble: 2 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 GLMNET - Spline Test 5.36 86.5 16.1 55.9 5.89 0.264
## 2 2 GLMNET - Lag Test 0.143 2.21 0.431 2.19 0.173 0.984
calibration_tbl_2 %>%
modeltime_forecast(
new_data = new_test_2016,
actual_data = data_2016_full
) %>%
plot_modeltime_forecast(.conf_interval_show = FALSE)
library(earth)
# Multivariate Adaptive Regression Spline model
model_spec_mars <- mars(mode = "regression") %>%
set_engine("earth")
#recipe_spec <- recipe(unemploy_rate_la ~ DATE, data = new_train_2016) %>%
#step_date(DATE, features = "month", ordinal = FALSE) %>%
#step_mutate(date_num = as.numeric(DATE)) %>%
#step_normalize(date_num) %>%
#step_rm(DATE)
wflw_fit_mars_spline <- workflow() %>%
add_model(model_spec_mars) %>%
add_recipe(recipe_spec_1) %>%
fit(new_train_2016)
wflw_fit_mars_lag <- workflow() %>%
add_model(model_spec_mars) %>%
add_recipe(recipe_spec_2) %>%
fit(new_train_2016)
wflw_fit_mars_simple <- workflow() %>%
add_recipe(recipe_spec) %>%
add_model(model_spec_mars) %>%
fit(new_train_2016)
calibration_tbl_3 <- modeltime_table(
wflw_fit_mars_spline,
wflw_fit_mars_lag,
wflw_fit_mars_simple
)
calibration_tbl_3 %>%
modeltime_forecast(
new_data = new_test_2016,
actual_data = data_2016_full
) %>%
plot_modeltime_forecast(.conf_interval_show = FALSE)
Add fitted models to a Model Table
models_tbl <- modeltime_table(
#model_fit_arima_no_boost,
model_fit_arima_boosted,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars
)
models_tbl
## # Modeltime Table
## # A tibble: 5 x 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS
## 2 2 <fit[+]> ETS(A,AD,A)
## 3 3 <fit[+]> PROPHET
## 4 4 <fit[+]> LM
## 5 5 <workflow> EARTH
Calibrate the model to a testing set
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testing(splits))
calibration_tbl
## # Modeltime Table
## # A tibble: 5 x 5
## .model_id .model .model_desc .type .calibration_da~
## <int> <list> <chr> <chr> <list>
## 1 1 <fit[+]> ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOS~ Test <tibble>
## 2 2 <fit[+]> ETS(A,AD,A) Test <tibble>
## 3 3 <fit[+]> PROPHET Test <tibble>
## 4 4 <fit[+]> LM Test <tibble>
## 5 5 <workflow> EARTH Test <tibble>
calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = data_2016
) %>%
plot_modeltime_forecast(
.legend_max_width = 25
)
calibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = FALSE
)
| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS | Test | 0.72 | 12.68 | 2.18 | 11.22 | 1.01 | 0.67 |
| 2 | ETS(A,AD,A) | Test | 0.60 | 10.02 | 1.82 | 9.35 | 0.74 | 0.85 |
| 3 | PROPHET | Test | 6.37 | 100.78 | 19.17 | 64.20 | 6.60 | 0.52 |
| 4 | LM | Test | 2.35 | 38.82 | 7.07 | 30.58 | 2.67 | 0.00 |
| 5 | EARTH | Test | 0.64 | 10.57 | 1.94 | 9.80 | 0.75 | 0.94 |
It seems that